Competitive Group Testing and Learning Hidden Vertex Covers with Minimum Adaptivity

نویسندگان

  • Peter Damaschke
  • Azam Sheikh Muhammad
چکیده

Suppose that we are given a set of n elements d of which are “defective”. A group test can check for any subset, called a pool, whether it contains a defective. It is well known that d defectives can be found by using O(d logn) pools. This nearly optimal number of pools can be achieved in 2 stages, where tests within a stage are done in parallel. But then d must be known in advance. Here we explore group testing strategies that use a nearly optimal number of pools and a few stages although d is not known to the searcher. One easily sees that O(log d) stages are sufficient for a strategy with O(d logn) pools. Here we prove a lower bound of Ω(log d/ log log d) stages and a more general pools vs. stages tradeoff. As opposed to this, we devise a randomized strategy that finds d defectives using O(d log(n/d)) pools in 3 stages, with any desired probability 1 − . Open questions concern the optimal constant factors and practical implications. A related problem motivated by, e.g., biological network analysis is to learn hidden vertex covers of a small size k in unknown graphs by edge group tests. (Does a given subset of vertices contain an edge?) We give a 1-stage strategy using O(k logn) pools, with any FPT algorithm for vertex cover enumeration as a decoder. 1 Background and Contributions The group testing problem is to find d elements called positive (or synonymously, defective) elements in a set X of size n by queries of the following type. The searcher can choose arbitrary subsets Q ⊂ X called pools, and ask whether Q contains at least one defective. Group testing has several applications, most notably in biological and chemical testing. Throughout this paper, log means log2 if no other base is mentioned. Nondefective elements are called negative. A positive pool is a pool containing some defective, thus responding Yes to a group test. A negative pool is a pool without defectives, thus responding No to a group test. By the information-theoretic lower bound, at least log ( n d ) ≈ d log(n/d) pools are needed to find d defectives even if the number d is known in advance, and it is an easy exercise to devise an adaptive query strategy using O(d log(n/d)) pools. Here, a strategy is called adaptive if queries are asked sequentially, that is, every pool can be prepared based on the outcomes of all earlier queries. For many applications however, the time consumption of adaptive strategies is hardly acceptable, and strategies that work in a few stages are strongly preferred: The pools for every stage must be prepared in advance, depending on the outcomes of earlier stages, and then they are queried in parallel. It is well known that 1-stage strategies need Ω(d log n/ log d) pools, and O(d log n) pools are sufficient. The currently best factor is 4.28; see [7] and the references there. The first 2-stage strategy using a number of pools within a constant factor of optimum, more precisely 7.54 d log(n/d), was developed in [10] and later improved to essentially 4 d log(n/d) [13] and finally 1.9 d log(n/d), or even 1.44 d log(n/d) for large enough d [7]. These strategies use stage 1 to find O(d) candidate elements including all defectives, which are then tested individually in stage 2. The 2-stage strategies still require the knowledge of an upper bound d on the number of defectives, and they guarantee an almost optimal query complexity only relative to this d which can be much larger than the true number of defectives in the particular case. As opposed to this, adaptive strategies with O(d log(n/d)) pools do not need any prior knowledge of d. Beginning with [3, 11, 12], substantial work has been done to minimize the constant factor in O(d log(n/d)), called the competitive ratio. The currently best results are in [16]. Our problem with unknown d was also raised in [14], and several batching strategies have been proposed and studied experimentally. To our best knowledge, the present paper is the first to establish rigorous results for this question: Can we take the best of two worlds and perform group testing without prior knowledge of d in a few stages, using a number of pools close to the informationtheoretic lower bound? This question is not only of theoretical interest. If the number d of defectives varies a lot between the problem instances, then the conservative policy of assuming some “large enough” d systematically requires unnecessarily many tests, while a strategy with underestimated d even fails to find all defectives. It is fairly obvious that a 1-stage strategy cannot do better than n individual tests. On the bright side, O(log d) stages are sufficient to accommodate a strategy with O(d log(n/d)) pools: Simply double the assumed d in every other stage, and apply the best 2-stage strategy repeatedly, including a check if all defectives have been found. In this paper we prove that any deterministic strategy that insists on O(d log n) pools needs s = Ω(log d/ log log d) stages in the worst case. This clearly separates the complexity of the cases with known and unknown d. By the same proof technique we show tradeoff lower bounds for pools and stages. In particular, the number of pools in deterministic strategies with constantly many stages cannot be limited to any function f(d) log n. Whereas the proof idea is a standard “version space” argument counting the number of consistent hypotheses, the details of the adversary strategy and counting process are not obvious. We explore a hypergraph representation of the query results. There remains a log log d gap between our current bounds. We conjecture that our proof can be refined to give a matching Ω(log d) lower bound. The next result shows the power of randomization: We propose a Las Vegas strategy that uses O(d log n) pools in only 3 stages and succeeds with any prescribed constant probability, arbitrarily close to 1. Obviously, the only thing we need is a good upper bound on d, because then we can apply any known 2-stage strategy with O(d log n) pools, using our bound instead of the unknown actual d. And such an estimate for d is obtained by O(log n) randomized pools in stage 1. Once more, the principal idea is simple (we use pools of exponentially growing size and guess d based on the query outcomes), but the practical challenge is to achieve low constant factors in the total query number O(d log n). Similarly, by using O(d log n) pools we need only 2 stages, with arbitrarily high constant probability. Note that we can always recognize in the last stage whether all defectives have been found (and d was not underestimated), by one extra query to the complement of the candidate set. In the unlikely negative case we can simply repeat the strategy with a somewhat larger bound d, hence we eventually find all defectives in a constant expected number of stages and within the same asymptotic query complexity. An open question is whether O(d log n) randomized pools in 2 stages are sufficient. Related to this discussion, one may wonder if determining the exact number of defectives by group tests is perhaps easier than actually identifying the defectives. Note that in applications like environmental testing we may only be interested in the amount of contamination of samples, rather than in individual items. However, our lower-bound proof yields as a byproduct that the complexity is the same. In related work [8] we studied query strategies and the computational complexity of learning Boolean functions depending on only a few unknown relevant variables. Group testing is the special case where the Boolean function is already known to be the disjunction of the relevant variables. One modern application of group testing is the reconstruction of biological networks, e.g., protein interaction networks, by experiments that signal the presence of at least one interaction in a “pool” of proteins. If a group test is available that signals interaction of one fixed protein called a bait, with a pool of other proteins, the problem of finding all interaction partners of a bait is just the group testing problem. Since the degrees d of vertices in interaction networks are very different and tests are time-consuming, we arrive at exactly the problem setting considered in this paper. Instead of learning a whole graph, i.e., the neighbors of every vertex, we may want to learn only a small set of vertices that is incident to all edges, that is, a small vertex cover. In interaction networks they can be expected to play a major role, as a small vertex cover represents, e.g., a small group of proteins involved in all interactions [15]. Suppose that an edge group test is available that tells, for a pool Q of vertices, whether some vertices in Q are joined by an edge. This assumption is also known as the complex model of group testing. Then we encounter the problem of learning a hidden vertex cover: Given a graph with a known vertex set but an unknown edge set, and a number k, identify a vertex cover of size at most k (or all of them), by using a possibly small number of edge group tests. Learning hidden structures in graphs has been intensively studied for many structures and query models, we refer to [1, 2, 4] for recent results and a survey. Learning a hidden star [1] is a related but quite different problem. Note that the vertex cover problem is NP-complete already for “known” graphs, on the other hand, it is a classical example of a fixed-parameter tractable (FPT) problem: It can be solved in O(bp(n)) time, with some constant base b and some fixed polynomial p. In a sense we extend the classical FPT result and show that hidden vertex covers can be learned efficiently and nonadaptively if k is small. Organization of the paper: In Section 2 we derive a lower bound tradoff for stages vs. pools in deterministic group testing strategies when d is not given to the searcher. Section 3 presents a randomized strategy for estimating the number of defectives, leading to, e.g., a randomized competitive 3-stage group testing strategy. In Section 4 we give our FPT-style result for learning hidden vertex covers. Section 5 discusses potentially interesting questions for further research. In order to emphasize the main ideas and also due to space limitations we have omitted technicalities in several proofs, but in principle the proofs are complete. 2 A Lower Bound for Adaptivity in Competitive Deterministic Group Testing In this section we give an adversarial answer strategy that forces a certain minimum number stages upon a searcher who wants to keep the number of pools restricted. Consider a set X of elements, containing an unknown subset of defectives. Definition 1. Given a set P of pools, the response vector t assigns every positive (negative) pool the value 1 (0). Let P and P− be the set of positive and negative pools, respectively. The response hypergraph RH(P, t) has the vertex set V := X \ ⋃ Q∈P− Q, and every Q ∈ P is turned into a hyperedge Q ∩ V of RH(P, t). Intuitively that means: The response vector just describes the outcome of a group testing experiment on the set P of pools. The vertices of RH(P, t) are all elements that appear in no negative pool. The hyperedges of RH(P, t) are the positive pools restricted to these vertices, that is, all elements recognized as negative are removed. A hitting set of a hypergraph is a set of vertices that intersects every hyperedge. Note that a superset of a hitting set is a hitting set, too. From the definitions it follows immediately: Lemma 1. Given a response vector t, the family of possible sets of defectives, i.e. those consistent with t, is exactly the family of hitting sets of RH(P, t). ut Before we state our adversary strategy in detail, we outline its structure. Consider any deterministic group testing strategy that works in stages. The main idea of our adversary strategy is to answer the queries in every stage in such a way that RH(P, t) has some hitting set that is much smaller than the vertex set. This leaves the searcher uncertain about the status (positive or negative) of all the other vertices in RH(P, t). Note that an adversary working against a deterministic searcher can hide defectives after having seen the pools. The second idea is a standard technical trick used in many lower-bound proofs to simplify the analysis: The adversary may cautiously reveal some extra information. Specifically, our adversary tells the searcher a subset of defectives that forms already a hitting set of RH(P, t). The effect is that all hyperedges of RH(P, t) are now “explained” by the revealed defectives, thus RH(P, t) does not contain any further useful information for the searcher. Hence the searcher can even totally forget the hypergraph, and the searcher’s knowledge is represented by two sets: the already known defectives, and the elements whose status is yet unknown; each of the latter elements can be (independently!) positive or negative. We will play with the cardinalities of these two sets and make the searcher’s life as hard as possible. Specifically: Let f be any monotone increasing function and d the true number of defectives. Suppose that the searcher is aiming for at most f(d) log n queries in total. Let us consider the moment prior to any stage. Suppose that k defectives are already known and u elements are yet undecided. As we might have d = k, the searcher can prepare a set P of at most f(k) log n pools for the next stage. (Actually, the number of pools already used up in earlier stages must be subtracted, which makes the limit even lower, but our analysis does not take advantage of this fact.) These queries in P can generate at most 2 logn = n different response vectors. The adversary chooses some number h ≤ u and announces that h or more further elements are also defective. In particular, there exist ( u h ) possible sets of exactly h further defectives. By the pigeonhole principle, some family T of at least ( u h ) /n of these candidate sets generate the same (consistent) response vector t. Now the adversary answers with just this response vector t. Let Y denote the union of all sets in T . Lemma 2. Y is entirely in the vertex set of RH(P, t). Proof. Assume that some q ∈ Y is in some pool Q which is negative in t, that means, t(Q) = 0. By the definition of Y , element q also belongs to some Z ⊆ Y such that response vector t is generated if Z is the actual set of defectives. This contradicts Q ∩ Z = ∅. ut Define y = |Y |. Finally, the adversary actually names a set H of h new defectives in Y , in compliance with t. By Lemma 1,H is a hitting set in RH(P, t). Since arbitrary supersets ofH are hitting sets, too, and Y is included in RH(P, t) by Lemma 2, it follows that H plus any of the y − h elements of Y \ H build a hitting set of RH(P, t). Using Lemma 1 again, we conclude that the y − h elements of Y \H may still be defective or not, independently of each other. Since Y must contain at least ( u h ) /n different subsets of size h, we get the following chain of inequalities:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Distributed Decision-Making Complexity of the Minimum Vertex Cover Problem

In this paper we study the problem of computing approximate vertex covers of a graph on the basis of partial information in the distributed decision-making model proposed by Deng and Papadimitriou 1]. In particular, we show an optimal algorithm whose competitive ratio is equal to p, where p is the number of processors.

متن کامل

Structural Knowledge Discovery in Chemical and Spatio-Temporal Databases

Most current knowledge discovery systems use only attribute-value information. But relational information between objects is also important to the knowledge hidden in today’s databases. Two such domains are chemical structures and domains where objects are related in space and time. Inductive Logic Programming (ILP) discovery systems handle relational data, but require data to be expressed as a...

متن کامل

Crown reductions for the Minimum Weighted Vertex Cover problem

The paper studies crown reductions for the Minimum Weighted Vertex Cover problem introduced recently in the unweighted case by Fellows et al. ([20], [1]). We describe in detail a close relation of crown reductions to Nemhauser and Trotter reductions that are based on the linear programming relaxation of the problem. We introduce and study the so called strong crown reductions, suitable for find...

متن کامل

Vertex Covers and Connected Vertex Covers in 3-connected Graphs

A vertex cover of a graph G=(V,E) is a subset N of V such that each element of E is incident upon some element of N, where V and E are the sets of vertices and of edges of G, respectively. A connected vertex cover of a graph G is a vertex cover of G such that the subgraph G[N] induced by N of G is connected. The minimum vertex cover problem (VCP) is the problem of finding a vertex cover of mini...

متن کامل

Two Minimum Dominating Sets with Minimum Intersection in Chordal Graphs

We prove that the problem of nding two minimum dominating sets (connected dominating sets or vertex clique covers) with minimum intersection is linearly solvable in interval graphs. Furthermore, the problem of deciding whether or not there exist two disjoint minimum dominating sets (connected dominating sets or vertex clique covers) is shown to be NP-hard for chordal graphs.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009